Recognition of printed arabic text based on global features and decision tree learning techniques
نویسنده
چکیده
Machine simulation of human reading has been the subject of intensive research for almost three decades. A large number of research papers and reports have already been published on Latin, Chinese and Japanese characters. However, little work has been conducted on the automatic recognition of Arabic in both on-line and o!-line, has been achieved towards the automatic recognition of Arabic characters. This is a result of the lack of adequate support in terms of funding, and other utilities such as Arabic text databases, dictionaries, etc., and of course because of the cursive nature of its writing rules, and this problem is still an open research "eld. This paper presents a new technique for the recognition of Arabic text using the C4.5 machine learning system. The advantage of machine learning are twofold: it can generalize over the large degree of variations between di!erent fonts and writing style and recognition rules can be constructed by examples. The technique can be divided into three major steps. The "rst step is digitization and pre-processing to create connected component, detect the skew of a document image and correct it. Second, feature extraction, where global features of the input Arabic word is used to extract features such as number of subwords, number of peaks within the subword, number and position of the complementary character, etc., to avoid the di$culty of segmentation stage. Finally, machine learning C4.5 is used to generate a decision tree for classifying each word. The system was tested with 1000 Arabic words with di!erent fonts (each word has 15 samples) and the correct average recognition rate obtained using cross-validation was 92%. ( 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved.
منابع مشابه
Off-line Arabic Handwritten Recognition Using a Novel Hybrid HMM-DNN Model
In order to facilitate the entry of data into the computer and its digitalization, automatic recognition of printed texts and manuscripts is one of the considerable aid to many applications. Research on automatic document recognition started decades ago with the recognition of isolated digits and letters, and today, due to advancements in machine learning methods, efforts are being made to iden...
متن کاملروشی جدید جهت استخراج موجودیتهای اسمی در عربی کلاسیک
In Natural Language Processing (NLP) studies, developing resources and tools makes a contribution to extension and effectiveness of researches in each language. In recent years, Arabic Named Entity Recognition (ANER) has been considered by NLP researchers due to a significant impact on improving other NLP tasks such as Machine translation, Information retrieval, question answering, query result...
متن کاملDocument Analysis And Classification Based On Passing Window
In this paper we present Document analysis and classification system to segment and classify contents of Arabic document images. This system includes preprocessing, document segmentation, feature extraction and document classification. A document image is enhanced in the preprocessing by removing noise, binarization, and detecting and correcting image skew. In document segmentation, an algorith...
متن کاملUsing Machine Learning Algorithms for Automatic Cyber Bullying Detection in Arabic Social Media
Social media allows people interact to express their thoughts or feelings about different subjects. However, some of users may write offensive twits to other via social media which known as cyber bullying. Successful prevention depends on automatically detecting malicious messages. Automatic detection of bullying in the text of social media by analyzing the text "twits" via one of the machine l...
متن کاملClassification of encrypted traffic for applications based on statistical features
Traffic classification plays an important role in many aspects of network management such as identifying type of the transferred data, detection of malware applications, applying policies to restrict network accesses and so on. Basic methods in this field were using some obvious traffic features like port number and protocol type to classify the traffic type. However, recent changes in applicat...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Pattern Recognition
دوره 33 شماره
صفحات -
تاریخ انتشار 2000